Rank Statistics in Biological Evolution 
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We present a statistical analysis of biological evolution processes. Specifically, we study the 
stochastic replication-mutation-death model where the population of a species may grow or shrink 
by birth or death, respectively, and additionally, mutations lead to the creation of new species. We 
rank the various species by the chronological order by which they originate. The average population 
Nk of the fc"' species decays algebraically with rank, ~ M^k"^ , where M is the average total 
population. The characteristic exponent = (a — 7)/(a -|- /? — 7) depends on a, (5, and 7, the 
replication, mutation, and death rates. Furthermore, the average population of aU descendants 
of the fc"' species has a universal algebraic behavior, Pk ^ M . 

PACS numbers: 87.23.Kg, 02.50.-r, 87.10.+e, 05.40.-a 



Darwin's seminal ideas provide a qualitative basis for 
understanding biological evolution. Yet, quantitative 
characterization of evolution involves formidable chal- 
lenges. For example, there is a large uncertainty in the 
total number of species existing on Earth, with estimates 
ranging from 2 to 100 million Furthermore, only a 
small fraction of all of the extinct species is presently 
known 2] . Biological evolution remains one of the deep- 
est and most elusive problems in science. 

Fossil records 0,0 and molecular sequences data 0,0 
provide quantitative clues to understanding evolution. 
They are widely used to chronicle evolution processes us- 
ing "trees of life" 0, evolutionary trees that document 
how species, viruses, humans, etc., are related 0, 0|. 
Reconstruction of evolutionary trees from existing se- 
quences is a challenging problem because only par- 
tial information is available, and because one has to de- 
termine the most plausible evolution history from an ex- 
ponentially large number of scenarios. But the most sig- 
nificant difficulty is that the evolution laws themselves 
are unknown. In practice, tree reconstruction methods 
rely heavily on simplified evolution models: branching 
processes |lO|, [llj, [12] incorporating elementary processes 
such as replication, mutation, death, etc. 

We study the standard Replication-Mutation-Death 
(RMD) process that has been widely used to model speci- 
ation 1 3 , populati on gene tics Il3l. a r id g enome evolution 
IilEllll[HElESMl2il2lM& TheRMO pro- 
cess incorporates the minimal mechanisms for evolution: 
a family of species grows and shrinks by natural birth 
and death, respectively. Additionally, mutations create 
new families of species. 

We focus on the rank, namely, the chronological order 
by which the species are created. We study the total 
population size and the total descendant population size 
of a species of a given rank. Our main result is that 
both these quantities decay algebraically with the rank. 
While the scaling law characterizing the population size 



depends on the details of the evolution process, the scal- 
ing law characterizing the descendant population size is 
universal. We conclude that the chronological rank pro- 
vides a useful characteristic of evolution. 

The Replication-Mutation-Death (RMD) process is de- 
fined as follows. At any given time there are multiple 
families of distinct species and each species family has a 
certain size. Evolution proceeds as organisms (i) repli- 
cate: give birth to an identical child, (ii) mutate: give 
birth to a mutant, or (iii) die: are removed due to death. 
Replication events increase the size of the corresponding 
species family by one, and similarly, death events reduce 
the family size by one. In this minimal model, the popu- 
lation is asexual and therefore, replication mimics asex- 
ual reproduction. The key feature of this model is that 
each mutation event creates a distinct new species. All 
three processes are completely random and independent 
of each other. Initially, there is only one organism, and 
hence, only one species. We term the first species, the 
grand ancestor. 

Let a, P, and 7 be the rates of replication, mutation, 
and death, respectively. The total growth rate must of 
course be positive, a -I- /? — 7 > 0. Moreover, we restrict 
our attention to the biologically relevant case where the 
replication rate exceeds the death rate, a > 7. The 
most elementary characteristic is the total population 
size. The average total population size M grows accord- 
ing to dM/dt = {a + P- 7)M, and given M(0) = 1, the 
total population increases exponentially with time 
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Throughout this study, averaging is taken with respect to 
infinitely many independent realizations of the stochastic 
process. 

The RMD evolution process is illustrated in Fig. 1. 
One natural characteristic is the "distance" , the number 
of mutation events that separate a descendant and its 
ancestor. In Fig. 1, the distance between species 3 and 
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FIG. 1: Illustration of the replication- mutation-death process 
with a total of 4 different families of species and a total pop- 
ulation of 7 (here, there are no death events). The rank of 
each species is indicated. 



species 1 equals 2 and the distance between species 4 and 
species 1 is 1. Let Gn be the total population with dis- 
tance n from the grand ancestor. This quantity evolves 
according to 



dt 



= (a - -f) Gn + P Gn- 



(2) 



The initial condition is G„(0) = 6n,o. Equation ^ re- 
flects that the distance is augmented by one with each 
mutation. It is convenient to normalize Gn by the total 
population size, G„ — M gn- The quantity g„ satisfies 
dgn/d{l3t) = gn-i - gn with 5„(0) = S^fi- Solving this 
equation, the probability distribution gn is Poissonian, 
g^{t) = e"^7n!, and therefore 



(3) 



The Poissonian distribution of distance reflects the com- 
pletely random nature of the evolution process. 

Let S(t) be the average number of mutation events 
until time t. This quantity follows easily from the average 
population size, 
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For the initial condition 5(0) = 0, the average number of 
mutation events, or equivalently, the average number of 
distinct species generated is 



a + P — J 
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(5) 



Therefore, the total number of distinct species originated 
throughout the evolution quickly becomes of the same 
order as the total population size: S/M /3/{a + (3 — j) 
for t ^ l/{a + j3 — We note that the total number of 



existing species is generally smaller than the total number 
of species created because some species become extinct. 

The chronological order by which the species are cre- 
ated is a useful way to characterize the evolution process 
(Fig. 1). The grand ancestor has the index k = 1, the sec- 
ond species (generated by the flrst mutation event) has 
the index = 2, etc. We term this chronological index, 
the rank. With this deflnition, the rank of an ancestor 
is always smaller or equal than the rank of any of its de- 
scendants. We note that different deflnitions of rank are 
used to characterize data storage trees in computer sci- 
ence and river networks in geophysics |2^ . Statistics 
of the maximal rank are characterized by fk, the proba- 
bility that the total number of distinct species originated 
up to time t equals k. The distribution fk satisfies the 
evolution equation 



^=PMif,_,-f,) 
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The initial condition is /fe(0) — Sk,i and the boundary 
condition is fo{t) = 0. This equation reflects that ev- 
ery mutation event creates a new species. We stress that 
this evolution equation is not exact: it approximates the 
total population size, a fluctuating quantity, by its aver- 
age value [i^. In other words, it assumes that the total 
population size and the total number of distinct species 
are not correlated. Nevertheless, as shown below, this 
is an excellent approximation. Using Eq. I^J, the over- 
all multiplicative factor f3M in Eq. ^ is eliminated by 
redefining the time variable, dfk/dS — fk-i — fk- The 
resulting probability distribution is Poissonian, 
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again reflecting the random nature of the evolution pro- 
cess. By construction, the first two moments are exact, 
Efe /fe = 1 and Efc fc/fe = 1 + S. 

The distribution fk enables us to determine various 
statistical properties of the rank. A natural question is: 
what is the population of a species of a given rank? For 
example, in figure 1, the population of the first species 
equals 3 and the population of the second species equals 
2. Consider Nk, the average population size of the k^^ 
species. It evolves according to 



dNk 
dt 



= (a - Nk + 13 M fk-i 



(8) 



with the initial condition Nk{0) = Sk,i- The first term 
on the right-hand side accounts for replication and death 
while the second term describes mutations. As in Eq. ((HJ, 
this equation is approximate: in writing the second term, 
we assumed that the total population size and the to- 
tal number of species are not correlated. Writing Nk 



= (a-7)t 



Uk transforms Eq. into duk/dt = (3e^^ fk-\ 
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For the grand ancestor ni = 1, and for fc > 2 



(9) 



To obtain the large-rank behavior {k ^ 1) for large 
populations (t ^ 1) we use asymptotic analysis 
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with ^ ~ (a — 7) /(a + /3 — 7). The second line was 
obtained by substituting jSJ) and {T)) into © and the third 
line was obtained by replacing the upper integration limit 
by infinity because S diverges rapidly. The integral is 
estimated by the steepest descent method using the fact 
that the integrand is maximal for x k, k. Therefore, the 
average total population of a species decays algebraically 
with its rank (Fig. 2) 



Ni, ~ Ml" k- 



for fc ;» 1. The characteristic exponent 
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a + /3 - 7' 



(10) 



(11) 



is positive, /i > 0, since the replication rate exceeds the 
death rate. For the special case of no-death, 7 = 0, it is 
possible to derive the power-law behavior (fTm) - (fTT|l ex- 
actly by using the total population rather than time to 
characterize the evolution (this leads to exact difference 
evolution equations rather than the differential equation 
© nil)- When the replication rate approaches the death 
rate, the population size becomes rank-independent. The 
characteristic exponent < /i < 1 varies continuously 
with the three rates. 

To test the theoretical predictions, we performed 
Monte Carlo simulations. We focus on the no-death case 
(7 = 0) because it can be simulated efficiently, thereby 
allowing us to generate populations of size up to 10 ''. The 
simulation is straightforward: an organism is picked at 
random. Then, with probability a/ {a + (i) an identical 
organism is created (replication), while with probability 
(i/{a + (i), a new species is created (mutation). The sim- 
ulation results are in excellent agreement with the theo- 
retical results. For example, for the case a = /3, the ex- 
ponent agrees with the theoretical prediction /i = 1/2 to 
within 0.1%. We also simulated the case where the death 
rate is finite and the agreement between the simulations 
and the theoretical predictions concerning the exponent 
/i was equally strong. We conclude that the numerical 
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FIG. 2: The normalized size distribution Uk — M~^^'^Nk ver- 
sus fc, obtained from 10^ independent Monte Carlo realization 
for a total population of size 10''. The rates are a = /3 and 
7 = 0. Also shown for comparison is the theoretical prediction 
rife ~A;-i/^ 



simulations indicate that even though the rate equations 
and (jSj) are approximate, they yield asymptotically 
exact results for the large-rank behavior. In particular, 
the predicted exponent ^ appears to be exact. 

We also noticed that in a single realization, Nk devi- 
ates significantly from the theoretical prediction, indicat- 
ing that there are strong sample-to-sample fiuctuations. 
This is not surprising because the populations are grow- 
ing exponentially. To extract meaningful results from a 
single realization, it is necessary to study statistics of the 
variable z = Ink. Using this exponentially-distributed 
variable, the large fluctuations are suppressed, and the 
power-law behavior H10|l is clear. 

A closely-related quantity is the total descendant pop- 
ulation of a species, that is, the total population of all 
species that emanated from a given species. For example, 
in figure 1, the total descendant population of the first 
species is 7 and that of the second species is 3. Fixing the 
fc*'' species as a common ancestor, we denote by Pk the 
average size of its descendant population. This quantity 
satisfies the initial condition Pk{0) — Sk,i and it evolves 
according to 

^^{a + P-j)Pk+f3Mfk-i. (12) 
dt 

Since a descendant of a descendant is also a descendant, 
the growth rate now accounts for mutation too, in con- 
trast with Eq. 0. The transformation Pk = M pk re- 
casts Eq. (fT^ into dpk/dt = (3fk-i- Since every species is 
the descendant of the grand ancestor, pk = 1. Otherwise, 
for fc > 2 one has 



Pkit) = P / drfk-iir). 
Jo 



(13) 



The quantity pk is of course the probability that a 
randomly selected organism is a descendant of the k^^ 
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FIG. 3: The descendant size distribution. Plotted is pk = 
M~^Pk versus k, obtained from 10^ independent Monte Carlo 
realization for a population of size M = 10^. The death 
rate is zero (7 = 0), while replication and mutation rates are 
equal (a = /3). Also shown for comparison is the theoretical 
prediction pk ^ . 



species. Repeating the asymptotic analysis that has been 
appUed to Eq. we find that Pk{t) reaches an asymp- 
totic value, Pfc(t) — > pk{oo) as t 00, with 
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dx 
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1-M 



In tlie large-ranlc limit we have pk k^^, and therefore 

Pk^Mk'^ (14) 

(Fig. 3). Thus, the descendant size distribution obeys a 
universal law as the characteristic exponent is indepen- 
dent of the various rates. 

In summary, we obtained scaling laws that character- 
ize how the population of a species and its descendant 
population depend on the rank. The average popula- 
tion of a species decays algebraically with the rank fc, 
Nk k^^, with the characteristic exponent ^ dependent 
on the rates of rephcation, mutation, and death. The 
average descendant population of a species decays in a 
universal fashion with the rank k, ~ k~^. Our main 
conclusion is that the chronological order of origination, 
or the rank, provides a useful characterization of biolog- 
ical evolution processes. 

Several aspects of this model should be investigated 
further. Obtaining the exact behavior by properly ac- 
counting for the coupling between the total population 
size and the total number of species is a challenging prob- 
lem. We focused on the average behavior, but statisti- 
cal fluctuations must exhibit interesting behavior because 
the population is exponentially-growing. 

Power-law distributions with different rate-dependent 
exponents were found for another quantity, the family 
size distribution, for the very same replication-mutation- 
death processes [1^ llM l23 Ull • In general, the var- 



ious rates can not be measured directly. Since the char- 
acteristic exponents yield a constraint for the rates, such 
scaling-laws are useful since they reduce the number of 
unknown parameters in the problem. 

We acknowledge US DOE grant W-7405-ENG-36 for 
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